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^ ' Abstract 

, Very recently crowdsourcing has become the de facto platform for distributing and collecting human 

' computation for a wide range of tasks and applications such as information retrieval, natural language 

, processing and machine learning. Current crowdsourcing platforms have some limitations in the area of 

' quality control. Most of the effort to ensure good quality has to be done by the experimenter who has to 

, manage the number of workers needed to reach good results. 

I— I ■ We propose a simple model for adaptive quality control in crowdsourced multiple-choice tasks which 

, we call the bandit survey problem. This model is related to, but technically different from the well- 

^ 1 ■ known multi-armed bandit problem. We present several algorithms for this problem, and support them 

^ \ with analysis and simulations. Our approach is based in our experience conducting relevance evaluation 

Q ■ for a large commercial search engine. 

^ ; 1 Introduction 

00 ■ 

In recent years there has been a surge of interest in automated methods for crowdsourcing: a distributed 
^ ' model for problem-solving and experimentation that involves broadcasting the problem or parts thereof 

^ ■ to multiple independent, relatively inexpensive workers and aggregating their solutions. Automation and 

Q ! optimization of this process at a large scale allows to significantly reduce the costs associated with setting 

I up, running, and analyzing the experiments. Crowdsourcing is finding applications across a wide range of 

■ domains in information retrieval, natural language processing and machine learning. 

^ ^ . A typical crowdsourcing workload is partitioned into microtasks (also called Human Intelligence Tasks), 

^ \ where each micro task has a specific, simple structure and involves only a small amount of work. Each worker 

^ ■ is presented with multiple microtasks of the same type, to save time on training. The rigidity and simplicity 

- - - of the microtasks' structure ensures consistency across multiple multitasks and across multiple workers. 

An important industrial application of crowdsourcing concerns web search. One specific goal in this 
domain is relevance assessment: assessing the relevance of search results. One popular task design involves 
presenting a microtask in the form of a query along with the results from the search engine. The worker has 
to answer one question about the relevance of the query to the result set. Experiments such as these are used 
to evaluate the performance of a search engine, construct training sets, and discover queries which require 
more attention and potential algorithmic tuning. 

Stopping / selection issues. The most basic experimental design issue for crowdsourcing is the stopping 
issue: determining how many workers the platform should use for a given microtask before it stops and 
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outputs the aggregate answer. The workers in a crowdsourcing environment are not very reliable, so multiple 
workers are usually needed to ensure a sufficient confidence level. There is an obvious tradeoff here: using 
more workers naturally increases the confidence of the aggregate result but it also increases the cost and time 
associated with the experiment. One fairly common heuristic is to use less workers if the microtasks seem 
easy, and more workers if the microtasks seem hard. However, finding a sweet-spot may be challenging, 
especially if different microtasks have different degrees of difficulty. 

Whenever one can distinguish between workers, we have a more nuanced selection issue: which workers 
to choose for a given microtask? The workers typically come from a large, loosely managed population. 
Accordingly, the skill levels vary over the population, and are often hard to predict in advance. Further, the 
relative skill levels among workers may depend significantly on a particular microtask or type of microtasks. 
Despite this uncertainty, it is essential to choose workers that are suitable or cost-efficient for the micro-task 
at hand, to the degree of granularity allowed by the crowdsourcing platform. For example, while targeting 
individual workers may be infeasible, one may be able to select some of the workers' attributes such as age 
range, gender, country, or education level. Also, the crowdsourcing platform may give access to multiple 
third-party providers of workers, and allow to select among those. 

Our focus. This paper is concerned with a combination of the stopping / selection issues discussed above. 
We are looking for a clean setting so as to understand these issues at a more fundamental level. 

We focus on the scenario where several different populations of workers are available and can be targeted 
by the algorithm. As explained above, these populations may correspond to different selections of workers' 
attributes, or to multiple available third-party providers. We will refer to such populations as crowds. We 
assume that the quality of each crowd depends on a particular microtask, and is not known in advance. 

Each microtask is processed by an online algorithm which can adaptively decide which crowd to ask 
next. Informally, the goal is target the crowds that are most suitable for this microtask. Eventually the 
algorithm must stop and output the aggregate answer. 

This paper focuses on processing a single microtask. This allows us to simplify the setting: we do not 
need to model how the latent quantities are correlated across different microtasks, and how the decisions 
and feedbacks for different microtasks are interleaved over time. Further, we separate the issue of learning 
the latent quality of a crowd for a given microtask from the issue of learning the (different but correlated) 
quality parameters of this crowd across multiple microtasks. While the ideas developed in this paper extend 
to multiple microtasks, fleshing out this extension is left to future work. 

Our model: the bandit survey problem. We consider microtasks that are multiple-choice questions: one 
needs to choose among a given set O of possible answers (henceforth called options). Informally, the 
microtask has a unique correct answer x* € O, and the high-level goal of the algorithm to find it. 

The algorithm has access to several crowds: populations of workers. Each crowd i is represented by a 
distribution Vi over O, called the response distribution of this crowd. We assume that all crowds agree on 
the correct answer{3 some option x* G O is the unique most probable option for each Dj. 

In each round t, the algorithm picks some crowd i = it and receives an independent sample from the 
corresponding response distribution V^. Eventually the algorithm must stop and output its guess for x*. 
Each crowd i has a known per-round cost q. The algorithm has two objectives to minimize: the total cost 

Ci^ and the error rate: the probability that it makes a mistake, i.e. outputs an option other than x*. 

The independent sample in the above model abstracts the following interaction between the algorithm 
and the platform: the platform supplies a worker from the chosen crowd, the algorithm presents the micro- 
task to this worker, and the worker picks some option. 

'otherwise the algorithm's high-level goal is less clear. We chose to avoid this complication in the current version. 
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Alternative interpretation. The crowds can correspond not to different populations of workers but to 
different ways of presenting the same microtask. For example, one could vary the instructions, the order in 
which the options are presented, the fonts and the styles, and the accompanying images. 

The name of the game. Our model is similar to the extensively studied multi-armed bandit problem 
(henceforth, MAB) in that in each round an algorithm selects one alternative from a fixed and known set 
of available alternatives, and the feedback depends on the chosen alternative. However, while an MAB 
algorithm collects rewards, an algorithm in our model collects a survey of workers' opinions. Hence we 
name our model the "bandit survey problem." 

Discussion of tlie model. The bandit survey problem belongs to a broad class of online decision problems 
with explore-exploit tradeoff: that is, the algorithm faces a tradeoff between collecting information {explo- 
ration) and taking advantage of the information gathered so far {exploitation). The paradigmatic problem 
in this class is MAB: in each round an algorithm picks one alternative {arm) from a given set of arms, and 
receives a randomized, time-dependent reward associated with this arm; the goal is to maximize the total 
reward over time. Most papers on explore-exploit tradeoff concern MAB and its variants. 

The bandit survey problem is different from MAB in several key respects. First, the feedback is different: 
the feedback in MAB is the reward for the chosen alternative, whereas in our setting the feedback is the 
opinion of a worker from the chosen crowd. While the information received by a bandit survey algorithm 
can be interpreted as a "reward", the value of such reward is not revealed to the algorithm and moreover not 
explicitly defined. Second, the algorithm's goal is different: the goal in MAB is to maximize the total reward 
over time, whereas the goal in our setting is to output the correct answer. Third, in our setting there are two 
types of "alternatives": crowds and options in the microtask. Apart from repeatedly selecting between the 
crowds, a bandit survey algorithm needs to output one option: the aggregate answer for the microtask. 

An interesting feature of the bandit survey problem is that an algorithm for this problem consists of two 
components: a crowd-selection algorithm - an online algorithm that decides which crowd to ask next, and 
a stopping rule which decides whether to stop in a given round and which option to output as the aggregate 
answer. These two components are, to a large extent, independent from one another: as long as they do 
not explicitly communicate with one another (or otherwise share a common communication protocol) any 
crowd-selection algorithm can be used in conjunction with any stopping rulejl 

Our approach: independent design. Our approach is to design crowd-selection algorithms and stopping 
rules independently from one another. We make this design choice in order to make the overall algorithm 
design task more tractable. While this is not the only possible design choice, we find it productive, as it 
leads to a solid theoretical framework and algorithms that are practical and theoretically founded. 

Given this "independent design" approach, one needs to define the design goals for each of the two 
components. These goals are not immediately obvious. Indeed, two stopping rules may compare differ- 
ently depending on the problem instance and the crowd-selection algorithms they are used with. Likewise, 
two crowd-selection algorithms may compare differently depending on the problem instance and the stop- 
ping rules they are used with. Therefore the notions of optimal stopping rule and optimal crowd-selection 
algorithm are not immediately well-defined. 

We resolve this conundrum as follows. We design crowd-selection algorithms that work well across 
a wide range of stopping rules. For a fair comparison between crowd-selection algorithms, we use them 
with the same stopping rule (see Section |5]for details), and argue that such comparison is consistent across 
different stopping rules. 

^The no-communication choice is quite reasonable: in fact, it can be complicated to design a reasonable bandit survey algorithm 
that requires explicit communication between the crowd-selection algorithm and a stopping rule. 
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Our contributions. We introduce the bandit survey problem and present initial results in several directions: 
benchmarks, algorithms, theoretical analysis, and experiments. 

We are mainly concerned with the design of crowd-selection algorithms. Our crowd-selection algorithms 
work with arbitrary stopping rules. While we provide a specific (and quite reasonable) family of stopping 
rules for concreteness, third-party stopping rules can be easily plugged in. 

For the theoretical analysis of crowd-selection algorithms, we use a standard benchmark: the best time- 
invariant policy given all the latent information. The literature on online decision problems typically studies 
a deterministic version of this benchmark: the best fixed alternative (in our case, the best fixed crowd). We 
call it the deterministic benchmark. We also consider a randomized version, whereby an alternative (crowd) 
is selected independently from the same distribution in each round; we call it the randomized benchmark. 
The technical definition of the benchmarks, as discussed in Section^ roughly corresponds to equalizing the 
worst-case error rates and comparing costs. 

The specific contributions are as follows. 

(1) We largely solve the bandit survey problem as far as the deterministic benchmark is concerned. 
We design two crowd-selection algorithms, obtain strong provable guarantees, and show that they perform 
well in experiments. It should be emphasized that our provable guarantees hold for an arbitrary stopping 
rule (under a mild assumption). In particular, if all crowds have the same costs we prove that our algorithm 
approaches the deterministic benchmark up to a small additive factor. 

For comparison, we consider a naive crowd-selection algorithm that tries each crowd in a round-robin 
fashion. We prove that this algorithm, and more generally any crowd-selection algorithm that does not 
adapt to the observed workers' responses, performs very badly against the deterministic benchmark. While 
one expects this on an intuitive level, the corresponding mathematical statement is not easy to prove. In 
experiments, our proposed crowd-selection algorithms perform much better than the naive approach. 

(2) We observe that the randomized benchmark dramatically outperforms the deterministic benchmark 
on some problem instances. This is a very unusual property for an online decision problemJl However, we 
show that the two benchmarks coincide when there are only two possible answers. 

We design an algorithm which beats the deterministic benchmark on some problem instances (while not 
quite reaching the randomized benchmark). This appears to be the first published result in the literature on 
online decision problems where an algorithm provably improves over the deterministic benchmark. 

(3) We provide a specific stopping rule for concreteness; this stopping rule is simple, tunable, has 
nearly optimal theoretical guarantees (in a certain formal sense), and works well in experiments. 

Further discussion. The bandit survey problem has a bi-criteria objective: the total cost and the error rate. 
In a typical application, the customer is willing to tolerate a certain error rate, and wishes to minimize the 
total cost as long as the error rate is below this threshold. However, as the error rate depends on the problem 
instance, there are many different ways to make this objective formal. Indeed, one could consider the worst- 
case error rate (the maximum over all problem instances), a typical error rate (the expectation over a given 
"typical" distribution over problem instance), or a more nuanced notion such as the maximum over a given 
family of "typical" distributions. Note that the "worst-case" guarantees may be overly pessimistic, whereas 
considering "typical" distributions makes sense only if one knows what these distributions are. 

An alternative objective is to assign a monetary penalty to a mistake, and optimize the overall cost, i.e. 
the cost of labor minus the penalty. However, it may be exceedingly difficult for a customer to assign such 

''We are aware of only one published example of an online decision problem with this property, in a very different context of 
dynamic pricing js]. However, the results in ||5] focus on a special case where the two benchmarks essentially coincide. 
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monetary penaltyjj whereas it is typically feasible to specify tolerable error rates. While we think that both 
alternatives are worth studying, we chose to follow the bi-criteria objective in this paper. 



2 Related work 

For general background on crowdsourcing and human computation, refer to the book 11211 . Most of the work 
on crowdsourcing is usually done using platforms like Amazon Mechanical Turk or CrowdFlower. Results 
using those platforms have shown that majority voting is a good approach to achieve quality [26 1. Get 
Another Label [25] explores adaptive schemes for the single-crowd case under Baysian assumptions (while 
our focus is on multiple-crowds and regret under non-Basysian uncertainty). A study on machine translation 
quality uses preference voting for combining ranked judgments [0. Vox Populi [12] suggests to prune low 
quality workers, however their approach is not adaptive and their analysis does not provide regret bounds 
(while our focus is on adaptively choosing which crowds to exploit and obtaining regret bounds against an 
optimal algorithm that knows the quality of each crowd). Budget-Optimal Task Allocation liTTl focuses on 
a non-adaptive solution to the task allocation problem given a prior distribution on both tasks and judges 
(while we focus adaptive solutions and do not assume priors on judges or tasks). From a methodology 
perspective, CrowdSynth focuses on addressing consensus tasks by leveraging supervised learning [16|. 
Adding a crowdsourcing layer as part of a computation engine is a very recent line of research. An example 
is CrowdDB, a system for crowdsourcing which includes human computation for processing queries 0131 . 
CrowdDB offers basic quality control features, but we expect adoption of more advanced techniques as 
those systems become more available within the community. 

Multi-armed bandits (MAB) have a rich literature in Statistics, Operations Research, Computer Science 
and Economics. A proper discussion of this literature is beyond our scope; see [|9l for background. Most 
relevant to our setting is the work on prior-free MAB with stochastic rewards: ||20] |3] and the follow-up 
work, and Thompson heuristic [27 J. Recent work on Thompson heuristic includes |[T4l l22l [TOl [Tl. Our 
setting is superficially similar to a version of MAB where the goal is to find the best arm after a fixed period 
of exploration (e.g., ll23l |2). The difference is that a bandit survey algorithm needs to pick the correct 
option, whereas arms correspond to crowds. 

Settings similar to stopping rules for a single crowd (but with somewhat different technical objectives) 
were considered in prior work, e.g. [0 US] IH] El]. 

3 Preliminaries and notation 

There are k crowds and n options (possible answers to the microtask). O denotes the set of all options. An 
important special case is uniform costs: all Cj are equal; then the total cost is simply the stopping time. 

Fix round t in the execution of a bandit survey algorithm. Let Ni^t be the number of rounds before t in 
which crowd i has been chosen by the algorithm. Among these rounds, let Ni^t{x) be the number of times a 
given option x G O has been chosen by this crowd. The empirical distribution T>i^t for crowd i is given by 
T^i,t{x) = Ni^t{x) / Ni^t for each option x. We use "Dj ( to approximate the (latent) response distribution Dj. 

Define the gap e{T>) of a finite-support probability distribution V as the difference between the largest 
and the second-largest probability values in V. If there are only two options (n = 2), the gap of a distribution 
over O is simply the bias towards the correct answer Let = e(2?j) and ej ^ = e{T>i^t) be, respectively, the 
gap and the empirical gap of crowd i. 

"'in particular, this was the case in the authors' collaboration with a commercial crowdsourcing platform. 
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We will use vector notation over crowds: the cost vector c = (ci , ... ,Cfc), the gap vector e = 
(ei , ... , efc), and the response vector D{x) = ('Di(x) , ... ,Vk{x)) for each option x G O. 

Map of the paper. The rest of the paper is organized as follows. As a warm-up and a foundation, we 
consider stopping rules for a single crowd (Section |4). Benchmarks are formally defined in Section |5] 
Design of crowd-selection algorithms with respect to the deterministic benchmark is treated in Section |6] 
We discuss the randomized benchmark in Section |7l we design and analyze an algorithm for this benchmark 
in Section[8] We present our experimental results Section|9]and Section [TOl respectively for a single crowd 
and for selection over multiple crowds. We discuss open questions in Section [TT] To improve the flow of 
the paper, some of the proofs and some of the plots are moved to the Appendix. 

4 A warm-up: single-crowd stopping rules 

Consider a special case with only one crowd to choose from. It is clear that whenever a bandit survey 
algorithm decides to stop, it should output the most frequent option in the sample. Therefore the algorithm 
reduces to what we call a single-crowd stopping rule: an online algorithm which in every round inputs an 
option X G O and decides whether to stop. When multiple crowds are available, a single-crowd stopping 
rule can be applied to each crowd separately. This discussion of the single-crowd stopping rules, together 
with the notation and tools that we introduce along the way, forms a foundation for the rest of the paper. 

A single-crowd stopping rule is characterized by two quantities that are to be minimized: the expected 
stopping time and the error rate: the probability that once the rule decides to stop, the most frequent option 
in the sample is not x*. Note that both quantities depend on the problem instance; therefore we leave the 
bi-criteria objective somewhat informal at this point. 

A simple single-crowd stopping rule. We suggest the following single-crowd stopping rule: 

Stop if ti^t Ni,t > Cqty y^t- (1) 

Here i is the crowd the stopping rule is applied to, and Cqty is the quality parameter which indirectly 
controls the tradeoff between the error rate and the expected stopping time. Specifically, increasing Cqty 
decreases the error rate and increases the expected stopping time. If there are only two options, call them x 
and y, then the left-hand side of the stopping rule is simply \Ni^t{x) — A^i.t(y)|. 

The right-hand side of the stopping rule is a confidence term, which should be large enough to guarantee 
the desired confidence level. The s/N^t is there because the standard deviation of the Binomial distribution 
with samples is proportional to ^/N . 

In our experiments, we use a "smooth" version of this stopping rule: we randomly round the confidence 
term to one of the two nearest integers. In particular, the smooth version is meaningful even with Cqty < 1 
(whereas the deterministic version with Cqty < 1 always stops after one round). 

Analysis. We argue that the proposed single-crowd stopping rule is quite reasonable. To this end, we 
obtain a provable guarantee on the tradeoff between the expected stopping time and the worst-case error 
rate. Further, we prove that this guarantee is nearly optimal across all single-crowd stopping rules. Both 
results above are in terms of the gap of the crowd that the stopping rule interacts with. We conclude that the 
gap is a crucial parameter for the bandit survey problem. 

Theorem 4.1. Consider the stopping rule ([7|) with Cqty = ^^Jlog{^'Nf^,for some 6 > 0. The error rate of 
this stopping rule is at most 0{5), and the expected stopping time is at most O ( e^^ log j. 
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The proof of Theorem l4.1[ and several other proofs in the paper, rely on the Azuma-Hoeffding inequality. 
More specifically, we use the following corollary: for each C > 0, each round t, and each option x G O 



Pr 



\Vi{x)-%{x)\ <C/^/N~t 



>l-e-^(^'). (2) 



In particular, taking the Union Bound over all options x G O, we obtain: 

Pr [|ei,t-ei| <C/7A^] >l-ne-^^^'\ (3) 

Proof of Theorem WT\ Fix a > 1 and let Ct = Y^log(a j Nf^). Let £x,t be the event in Equation (O with 
C = Ct- Consider the event that E^.t holds for all options x G O and all rounds t; call it the clean event. 
Taking the Union Bound, we see that the clean event holds with probability at least 1 — 0{5/a). 

First, assuming the clean event we have |ej — < 2 Ct/ \/Nj~t for all rounds t. Then the stopping 

rule ^ stops as soon as ej > 3Ct/ y/Nj^, which happens as soon as Ni^t = O (^e^^ log ff^^ Integrating 
this over all a > 1, we derive that the expected stopping time is as claimed. 

Second, take o = 1 and assume the clean event. Suppose the stopping rule stops at some round t. Let x 
be the most probable option after this round. Then Vi^t{x) — Vi^tiu) ^ Ct/ \/Nn for all options y / x. It 
follows that Di{x) > Di{y) for all options y ^ x, i.e. x is the correct answer. □ 

The following lower bound easily follows from classical results on coin-tossing. Essentially, one needs 
at least r2(e^^) samples from a crowd with gap e > to obtain the correct answer. 

Theorem 4.2. Let Rq be any single-crowd stopping rule with worst-case error rate less than 5. When 
applied to a crowd with gap e > 0, the expected stopping time of Rq is at least n{e~'^ log ^). 

While the upper bound in Theorem l4.1l is close to the lower bound in Theorem 14.21 it is possible that one 
can obtain a more efficient version of Theorem 14. 1 1 using more sophisticated versions of Azuma-Hoeffding 
inequality such as, for example, the Empirical Bernstein Inequality. 

Stopping rules for multiple crowds. For multiple crowds, we consider stopping rules that are composed 
of multiple instances of a given single-crowd stopping rule Rq, we call them composite stopping rules. 
Specifically, we have one instance of Rq for each crowd (which only inputs answers from this crowd), and 
an additional instance of Rq for the total crowd - the entire population of workers. The composite stopping 
rule stops as soon as some Rq instances stops, and outputs the majority option for this instanceH Given a 
crowd-selection algorithm A, let cost(^|i?o) denote its expected total cost (for a given problem instance) 
when run together with the composite stopping rule based on Rq. 



5 Omniscient benchmarks for crowd selection 

We consider two "omniscient" benchmarks for crowd-selection algorithms: informally, the best fixed crowd 
i* and the best fixed distribution //* over crowds, where i* and ^* are chosen given the latent information: 
the response distributions of the crowds. Both benchmarks treat all their inputs as a single data source, and 
are used in conjunction with a given single-crowd stopping rule Rq (and hence depend on the Rq). 

^If -Ro is randomized, then each instance of _Ro uses an independent random seed. If multiple instances of i?o stop at the same 
time, the aggregate answer is chosen uniformly at random among the majority options for the stopped instances. 
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Deterministic benchmark. Let cost(i|i?o) be the expected total cost of always choosing crowd i, with Rq 
as the stopping rule. We define the deterministic benchmark as the crowd i* that minimizes cost(i|i?o) for a 
given problem instance. In view of the analysis in Section|4j our intuition is that cost(i|i?o) is approximated 
by Cj / up to a constant factor (where the factor may depend on Rq but not on the response distribution of 
the crowd). The exact identity of the best crowd may depend on Rq. For the basic special case of uniform 
costs and two options (assuming that the expected stopping time of Rq is non-increasing in the gap), the 
best crowd is the crowd with the largest gap. In general, we approximate the best crowd by argmirij Cj/e?. 

Randomized benchmark. Given a distribution ^ over crowds, let cost(^|i?o) be the expected total cost of 
a crowd-selection algorithm that in each round chooses a crowd independently from ji, treats all inputs as 
a single data source - essentially, a single crowd - and uses Rq as a stopping rule on this data source. The 
randomized benchmark is defined as the ji that minimizes cost(^|i?o) for a given problem instance. This 
benchmark is further discussed in Section |7] 

Comparison against the benchmarks. In the analysis, we compare a given crowd-selection algorithm A 
against these benchmarks as follows: we use A in conjunction with the composite stopping rule based on 
Rq, and compare the expected total cost cost(^|i?o) against those of the benchmarks. 

Let us argue that using the same Rq roughly equalizes the worst-case error rate between A and the 
benchmarks. Let p be the worst-case error rate of Rq, and assume it is achieved for gap e. Then the worst- 
case error rate for both benchmarks is p; it is achieved on a problem instance in which all crowds have gap 
e. It is easy to see that the worst-case error rate of A is at most {k + l)p, where k is the number of crowds. 

We also conjecture that the worst-case error rate for A is at least p. In the Appendix (Lemma [A.2I) . we 
prove a slightly weaker result: essentially, if the composite stopping rule does not use the total crowd, then 
the worst-case error rate for A is at least p (1 — 2k p). 

6 Crowd selection against the deterministic benchmark 

In this section we design crowd-selection algorithms that compete with the deterministic benchmark. 

Throughout the section, let Rq be a fixed single-parameter stopping rule. Recall that the deterministic 
benchmark is defined as min cost(i|i?o)> where the minimum is over all crowds i. We consider arbitrary 
composite stopping rules based on Rq, under a mild assumption that the Rq does not favor one option over 
another. Formally, we assume that the probability that Rq stops at any given round, conditional on any 
fixed history (sequence of observations that Rq inputs before this round), does not change if the options are 
permuted. Then Rq and the corresponding composite stopping rule are called symmetric. For the case of 
two options (when the expected stopping time of Rq depends only on the gap of the crowd that Rq interacts 
with) we sometimes make another mild assumption: that the expected stopping time decreases in the gap; 
we call such Rq gap-decreasing. 

6.1 Crowd-selection algorithms 

Virtual reward heuristic. Our crowd-selection algorithms are based on the following idea, which we call 
the virtual reward heuristicj^l Given an instance of the bandit survey problem, consider an MAB instance 
where crowds correspond to arms, and selecting each crowd i results in reward = f{ci/ef), for some 
fixed decreasing function /. (We can also plug in a better approximation for cost(i|i2o) when and if one 

*We thank anonymous reviewers for pointing out that our index -based algorithm can be interpreted via virtual rewards. 
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is available.) Call /j the virtual reward; note that it is not directly observed by a bandit survey algorithm, 
since it depends on the gap e, . However, various off-the-shelf bandit algorithms can be restated in terms of 
the estimated rewards, rather than the actual observed rewards. The idea is to use such bandit algorithms 
and plug in our own estimates for the rewards. 

A bandit algorithm thus applied would implicitly minimize the number of times suboptimal crowds are 
chosen. This is a desirable by-product of the design goal in MAB, which is to maximize the total (virtual) 
reward. We are not directly interested in this design goal, but we take advantage of the by-product. 

Algorithm 1: UCBl with virtual rewards. Our first crowd-selection algorithm is based on UCBl |3J, a 
standard MAB algorithm. We use virtual rewards /« = e^/ y^. 

We observe that UCBl has a property that at each time t, it only requires an estimate of fi and a confidence 
term for this estimate. Motivated by Equation Q, we use ti^t/^/ci as the estimate for fi, and C/ ^/ciNi~t as 
the confidence term. The resulting crowd-selection algorithm, which we call VirtUCB, proceeds as follows. 
In each round t it chooses the crowd i which maximizes the index li^t, defined as 



For the analysis, we use ^ with C = v^STogT. In our experiments, C = 1 appears to perform best. 

Algorithm 2: Thompson heuristic. Our second crowd-selection algorithm, called VirtThompson, is an 
adaptation of Thompson heuristic |[27l for MAB to virtual rewards fi = ei/ yjci. The algorithm proceeds as 
follows. For each round t and each crowd i, let Vi^t be the Bayesian posterior distribution for gap given 
the observations from crowd i up to round t (starting from the uniform prior). Sample independently from 
Vi^t- Pick the crowd with the largest index Qij As in UCBl, the index of crowd i is chosen from the 
confidence interval for the (virtual) reward of this crowd, but here it is a random sample from this interval, 
whereas in UCBl it is the upper bound. 

It appears difficult to compute the posteriors Vi^t exactly, so in practice an approximation can be used. 
In our simulations we focus on the case of two options, call them x, y. For each crowd i and round i, we 
approximate Vi^t by the Beta distribution with shape parameters a = 1 + Ni^t{x) and /3 = 1 + Ni^tiv), 
where Ni^t{x) > Ni^tiv)- (Essentially, we ignore the possibility that x is not the right answer.) 

It is not clear how the posterior Vi^t in our problem corresponds to the one in the original MAB problem, 
so we cannot directly invoke the analyses of Thompson heuristic for MAB lilOl fTl. 

Straw-man approaches. We compare the two algorithms presented above to an obvious naive approach: 
iterate through each crowd in a round-robin fashion. More precisely, we consider a slightly more refined 
version where in each round the crowd is sampled from a fixed distribution ^ over crowds. We will call such 
algorithms non-adaptive. The most reasonable version, called RandRR (short for "randomized round-robin") 
is to sample each crowd i with probability /ij ~ 1/cj |Zl 

In the literature on MAB, more sophisticated algorithms are often compared to the basic approach: first 
explore, then exploit. In our context this means to first explore until we can identify the best crowd, then 
pick this crowd and exploit. So for the sake of comparison we also develop a crowd-selection algorithm that 
is directly based on this approach. (This algorithm is not based on the virtual rewards.) In our experiments 
we find it vastly inferior to VirtUCB and VirtThompson. 

The "explore, then exploit" design does not quite work as is: selecting the best crowd with high proba- 
bility seems to require a high-probability guarantee that this crowd can produce the correct answer with the 

'For uniform costs it is natural to use a uniform distribution for ^. For non-uniform costs our choice is motivated by Theorem l6.3l 
where it (approximately) minimizes the competitive ratio. 




(4) 
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current data, in which case there is no need for a further exploitation phase (and so we are essentially back 
to RandRR). Instead, our algorithm explores until it can identify the best crowd with low confidence, then 
it exploits with this crowd until it sufficiently boosts the confidence or until it realizes that it has selected 
a wrong crowd to exploit. The latter possibility necessitates a third phase, called rollback, in which the 
algorithm explores until it finds the right answer with high confidence. 

The algorithm assumes that the single-crowd stopping rule Rq has a quality parameter Cqty which con- 
trols the trade-off between the error rate and the expected running time (as in Section |4l). In the exploration 
phase, we also use a low-confidence version of Rq that is parameterized with a lower value C^^y < Cqty; 
we run one low-confidence instance of Rq for each crowd. 

The algorithm, called ExploreExploitRollback, proceeds in three phases (and stops whenever the 
composite stopping rule decides so). In the exploration phase, it runs RandRR until the low-confidence 
version of Rq stops for some crowd i* . In the exploitation phase, it always chooses crowd i* . This phase 
lasts a times as long as the exploration phase, where the parameter a is chosen so that crowd i* produces a 
high-confidence answer w.h.p. if it is indeed the best crowd|§ Finally, in the roll-back phase it runs RandRR. 

6.2 Analysis: upper bounds 

We start with a lemma that captures the intuition behind the virtual reward heuristic, explaining how it helps 
to minimize the selection of suboptimal crowds. Then we derive an upper bound for VirtUCB. 

Lemma 6.1. Let i* = argmiiij Cj/e? be the approximate best crowd. Let Rq be a symmetric single-crowd 
stopping rule. Then for any crowd-selection algorithm A, letting Ni be #times crowd i is chosen, we have 

cost{A\RQ) < cost{e\RQ) + Z^^i, c^E[N,]. 

This is a non-trivial statement because cost(i*|i?o) refers not to the execution of A, but to a different 
execution in which crowd i* is always chosen. The proof uses a "coupling argument". 

Proof. Let A* be the crowd-selection algorithm which corresponds to always choosing crowd i*. 

To compare cost(^|i?o) and cost(^*|iio)> let us assume w.l.o.g. that the two algorithms are run on 
correlated sources of randomness. Specifically, assume that both algorithms aie run on the same realization 
of answers for crowd i*: the ^-th time they ask this crowd, both algorithms get the same answer Moreover, 
assume that the instance of Rq that works with crowd i* uses the same random seed for both algorithms. 

Let N be the realized stopping time for A*. Then A must stop after crowd i* is chosen N times. It 
follows that the difference in the realized total costs between A and A* is at most J2i CiNi. The claim 
follows by taking expectation over the randomness in the crowds and in the stopping rule. □ 

Theorem 6.2 (VirtUCB). Let i* = argminj Ci/ef be the approximate best crowd. Let Rq be a symmetric 
single-crowd stopping rule. Assume Rq must stop after at most T rounds. Define VirtUCB by with 
C = \J 8 log t, for each round t. Let Ki = {ci{fi* — fi))^'^ and K = X^j-^j. Aj. Then 

cost(VirtUCB|i?o) < cost(i*|iio) +0(Alogr). 

Proof Sketch. Plugging C = a/8 log f into Equation ^ and dividing by y/cl, we obtain the confidence 
bound for | /i — ej,t / y/ci \ that is needed in the the original analysis of UCB 1 in [3 ] . Then, as per that analysis, 
it follows that for each crowd i ^ i* and each round t we have E[A'^i^f] < Aj log t. (This is also not difficult 
to derive directly.) To complete the proof, note that t <T and invoke Lemma |6?T] □ 

'^We conjecture that for Ro from Section|4]one can take a = 0(Cqty/Cqty)- 
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Note that the approximate best crowd i* may be different from the (actual) best arm, so the guarantee in 
Theorem l6.2l is only as good as the difference cost(i*|i?o) — argmirij cost{i\Ro). Note that i* is in fact the 
best crowd for the basic special case of uniform costs and two options (assuming that Rq is gap-decreasing). 

It is not clear whether the constants Aj can be significantly improved. For uniform costs we have 
Ai = (ej* — ej)^^, which is essentially the best one could hope for. This is because one needs to try each 
crowd i / i* at least r2(Aj) times to tell it apart from crowd i*. El 

6.3 Analysis: lower bound for non-adaptive crowd selection 

We purpose of this section is argue that non-adaptive crowd- selection algorithms performs badly compared 
to VirtUCB. We prove that the competitive ratio of any non-adaptive crowd-selection algorithm is bounded 
from below by (essentially) the number of crowds. We contrast this with an upper bound on the competitive 
ratio of VirtUCB, which we derive from Theorem 16.21 

Here the competitive ratio of algorithm A (with respect to the deterministic benchmark) is defined 

as max ma°^cost(i|'Ro) ' ^^^^^ outer max is over all problem instances in a given family of problem 
instances. We focus on a very simple family: problem instances with two options and uniform costs, in 
which one crowd has gap e > and all other crowds have gap 0; we call such instances e-simple. 

Our result holds for a version of a composite stopping rule that does not use the total crowd. Note that 
considering the total crowd does not, intuitively, make sense for the e-simple problem instances, and we did 
not use it in the proof of Theorem 16. 2[ either. 

Theorem 6.3 (RandRR). Let Rq be a symmetric single-crowd stopping rule with worst-case error rate p. 
Assume that the composite stopping rule does not use the total crowd. Consider a non-adaptive crowd- 
selection algorithm A whose distribution over crowds is fi. Then for each e > 0, the competitive ratio over 
the e-simple problem instances is at least ^^ '^^ ^ (1 — 2kp), where k is the number of crowds. 

Note that min ^' ^' = k, where the min is taken over all distributions u. The minimizing a satisfies 
Pi ~ 1/cj for each crowd i, i.e. if p corresponds to RandRR. The (1 — 2k p) factor could be an artifact of 
our somewhat crude method to bound the "contribution" of the gap-0 crowds. We conjecture that this factor 
is unnecessary (perhaps under some minor assumptions on Rq). 

To prove Theorem 16.31 we essentially need to compare the stopping time of the composite stopping 
rule R with the stopping time of the instance of Rq that works with the gap-e crowd. The main technical 
difficulty is to show that the other crowds are not likely to force R to stop before this Rq instance does. To 
this end, we use a lemma that Rq is not likely to stop in finite time when applied to a gap-0 crowd. 

Lemma 6.4. Consider a symmetric single-crowd stopping rule Rq with worst-case error rate p. Suppose 
Rq is applied to a crowd with gap 0. Then Pr[i?o stops infinite time] < 2p. 

Proof. Intuitively, if Rq stops early if the gap is then it is likely to make a mistake if the gap is very small 
but positive. However, connecting the probability in question with the error rate of Rq requires some work. 

Suppose Rq is applied to a crowd with gap e. Let q{e, t, x) be the probability that Rq stops at round t 
and "outputs" option x (in the sense that by the time Rq stops, x is the majority vote). 

'This can be proved using an easy reduction from an instance of the MAB problem where each arm i brings reward 1 with 
probability (1 + €i)/2, and reward otherwise. Treat this as an instance of the bandit survey problem, where arms correspond to 
crowds, and options to rewards. An algorithm that finds the crowd with a larger gap in less than f2(Ai) steps would also find an 
arm with a larger expected reward, which would violate the corresponding lower bound for the MAB problem (see ||4]). 
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We claim that for all rounds t and each option x we have 



lim q{e,t,x) 



q{0,t,x). 



(5) 



Indeed, suppose not. Then for some 6 > there exist arbitrarily small gaps e > such that \q{e,t,x) — 
q{0, t,x)\ > 5. Thus it is possible to tell apart a crowd with gap from a crowd with gap e by observing 
independent runs of Rq, where each run continues for t steps. In other words, it is possible to tell 
apart a fair coin from a gap-e coin using Q{t6~'^) "coin tosses", for fixed t and 6 > and an arbitrarily 
small e. Contradiction. Claim proved. 

Let X and y be the two options, and let x be the correct answer. Let q{e, t) be the probability that Rq 
stops at round t. Let a{e\t) = q{e, t, y) /q{e, t) be the conditional probability that Rq outputs a wrong answer 
given that it stops at round t. Note that by Equation ^ for each round t it holds that q{e, t) — ?■ q{0, t) and 
a{e\t) — a(0|t) as e — 0. Therefore for each round to £ N we have: 



P = EteN 9(e,i) > T,t<to 

Note that Q!(0|t) = ^ by symmetry. It follows that Et<to ^(Oi*) 
probability that Rq stops in finite time is J2t^i 9(0' ^ ^p. 



E 



t<tn 



\t) q{0,t). 



< 2p for each Iq G N. Therefore the 

□ 



Proof of Theorem |Ol Suppose algorithm A is applied to an e-simple instance of the bandit survey problem. 
To simplify the notation, assume that crowd 1 is the crowd with gap e (and all other crowds have gap 0). 

Let be the instance of Rq that corresponds to a given crowd i. Denote the composite stopping rule 
by R. Let an be the stopping time of R: the round in which R stops. 

For the following two definitions, let us consider an execution of algorithm A that runs forever (i.e., it 
keeps running even after R decides to stop). First, let Tj be the "local" stopping time of the number of 
samples from crowd i that inputs before it decides to stop. Second, let ai be the "global" stopping time 
of the round when decides to stop. Note that ctr = miiij ai. 

Let us use Lemma [6!4l to show that R stops essentially when teH^ it to stop. Namely: 



E[ai] {l-2kp)<E[aR]. 



(6) 



To prove Equation consider the event E = {minj>i Tj = oo}, and let 1^; be the indicator variable of 
this event. Note that aji > ailE and that random variables ai and 1^; are independent. It follows that 
IE[o-i?] > Pr[-E] IE[<Ti]. Finally, Lemma |63] implies that Pt[E] > 1 - 2kp. Claim proved. 
Let it be the option chosen by A in round t. Then by Wald's identity we have 



E[ti] = E 
E[cost(^|i?o)] = E 



.t=l 
,t=l 



= E[l{,,=i}] E[^7i] = Pi E[a,] 
E[c^,]E[aR] = (E,Qpi)E[^jK]. 



Therefore, plugging in Equation (|6]l, we obtain 

E[cost(^|i?o)] ^ T.iCifii 



> 



(1 - 2kp). 



ciE[ri] ci/ii 

It remains to observe that ci E[ti] is precisely the expected total cost of the deterministic benchmark. □ 
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Competitive ratio of VirtUCB. Consider the case of two options and unifomi costs. Then (assuming Rq 
is gap-decreasing) the approximate best crowd i* in Theorem I6.2l is the best crowd. The competitive ratio 
of VirtUCB is, in the notation of Theorem l6.2[ at most 1 + ^stii'^ii!)) ■ ^^^^ factor is close to 1 when Rq is 
tuned so as to decrease the error rate at the expense of increasing the expected running time. 

7 The randomized benchmark 

In this section we further discuss the randomized benchmark for crowd-selection algorithms. Informally, it is 
the best randomized time-invariant policy given the latent information (response distributions of the crowds). 
Formally this benchmark is defined as min cost(/i|i?o), where the minimum is over all distributions /i over 
crowds, and Rq is a fixed single-parameter stopping rule. Recall that in the definition of cost(/i|i?o), the 
total crowd is treated as a single data source to which Rq is applied. 

The total crowd under a given fi behaves as a single crowd whose response distribution 2?^ is given by 
= Ej^^[Dj(x)] for all options x. The gap of will henceforth be called the induced gap of and 
denoted /(/i) = e(P^). If the costs are uniform then cost(/i|i?o) is simply the expected stopping time of 
Rq on V^, which we denote t{'D^). Informally, t{'D^) is driven by the induced gap of 

We show that the induced gap can be much larger than the gap of any crowd. 

Lemma 7.1. Let /x be the uniform distribution over crowds. For any e > there exists a problem instance 
such that the gap of each crowd is e, and the induced gap of ji is at least j^. 

Proof. The problem instance is quite simple: there are two crowds and three options, and the response 
distributions are (| + e, |, i - e) and (| + e, i - e, |). Then = (| + ^ - f , ^ - f )■ □ 

We conclude that the randomized benchmark does not reduce to the deterministic benchmark: in fact, 
it can be much stronger. Formally, this follows from Lemma ItTT] under a very mild assumption on Rq. 
that for any response distribution V with gap ^ or more, and any response distribution V whose gap is 
sufficiently small, it holds that t(P) » t{'D'). The implication for the design of crowd-selection algorithms 
is that algorithms that zoom in on the best crowd may be drastically suboptimal. Instead, for some problem 
instances the right goal is to optimize over distributions over crowds. 

However, the randomized benchmark coincides with the deterministic benchmark for some important 
special cases. First, the two benchmarks coincide if the costs are uniform and all crowds agree on the top 
two options (and Rq is gap-decreasing). Second, the two benchmarks may coincide if there are only two 
options (lOI = 2), see Lemma l7!2] below. To prove this lemma for non-uniform costs, one needs to explicitly 
consider cost(//|i?o) rather than just argue about the induced gaps. Our proof assumes that the expected 
stopping time of Rq is a concave function of the gap; it is not clear whether this assumption is necessary. 

Lemma 7.2. Consider the bandit survey problem with two options (\0\ =2). Consider a symmetric single- 
crowd stopping rule Rq. Assume that the expected stopping time of Rq on response distribution D is a 
concave function of e{T>). Then the randomized benchmark coincides with the deterministic benchmark. 
That is, cost(/i|i?o) ^ mini cost{i\RQ) for any distribution fJ^ over crowds. 

Proof. Let // be an arbitrary distribution over crowds. Recall that /(/i) denotes the induced gap of //. Note 
that /(/x) = /i • e . To see this, let O = {x, y}, where x is the correct answer, and write 

e(P^) = V^{x) - V^{y) = ■ D{x) - ■ D{y) = ^ • (^D{x) - D{y)) = fi ■ e. 
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Let A be the non-adaptive crowd-selection algorithm that corresponds to ^. For each round t, let if be 
the crowd chosen by A in this round, i.e. an independent sample from /x. Let be the realized stopping 
time of A. Let r(e) be the expected stopping time of Rq on response distribution with gap e. Note that 
E[N] = T{f{fi)). Therefore: 

cost(^|i?o) = E Xl^Ii Cit = E[ciJ E[iV] by Wald's identity 

= (c • fi) T(e • fi) > (c • fi) J2i l^i T{(^i) by concavity of r(-) 

> min Ci r(ej) by Claim IATT] 



min cost(i\Ri 



0) 



We have used a general fact that (x • a) (x • /3) > minj Qj/3j for any vectors q.,15 ^ and any fc-dimensional 
distribution x. A self-contained proof of this fact can be found in the Appendix (Claim IATI ). □ 



8 Crowd selection against the randomized benchmark 

We design a crowd-selection algorithm with guarantees against the randomized benchmark. We focus on 
uniform costs, and (a version of) the single-crowd stopping rule from Section H] 

Our single-crowd stopping rule i?o is as follows. Let t be the empirical gap of the total crowd. Then 
i?o stops upon reaching round t if and only if 

e*,t > Cqty/Vt or t = T. (7) 

Here Cqty is the "quality parameter" and T is a given time horizon. 

Throughout this section, let M. be the set of all distributions over crowds, and let /* = max^gx /(/x) 
be the maximal induced gap. The benchmark cost is then at least 

We design an algorithm A such that cost(^|i?o) is upper-bounded by (essentially) a function of /*, 
namely O {{f*)~'^^'^'^^)- We interpret this guarantee as follows: we match the benchmark cost for a distri- 
bution over crowds whose induced gap is (/*)2/('i=+2) gy Lemma |7?T1 the gap of the best crowd may be 
much smaller, so this is can be a significant improvement over the deterministic benchmark. 

Theorem 8.1. Consider the bandit survey problem with uniform costs. Let Rq be the single-crowd stopping 
rule given by (fT]). There exists a crowd-selection algorithm A such that 

cost(^|i?o) < O ((/*)-(^+2) . 

The proof of Theorem 18.11 relies on some properties of the induced gap: concavity and Lipschitz- 
continuity. Concavity is needed for the reduction lemma (Lemma 18.31) . and Lipschitz-continuity is used 
to solve the MAB problem that we reduce to. 

Claim 8.2. Consider the induced gap /(//) as a function on A4 C M^. First, f{fi) is a concave function. 
Second, |/(/u) — f{fJ-')\ <n\\fi — fi'\\iforany two distributions /ii,/i2 G -A^- 

Proof. Let be a distribution over crowds. Then 

/(/x) = V.ix*) - max V,ix) = min n- (d{x*) - D{x)) . (8) 

Thus, f{jj.) is concave as a minimum of concave functions. The second claim follows because 
{fi — ■ iD{x*) — D{x)] < n\\fi — fj,\\i for each option a; . □ 
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8.1 Proof of Theorem HH 



Virtual rewards. Consider the MAB problem with virtual rewards, where arms correspond to distributions 
IJ, over crowds, and the virtual reward is equal to the induced gap /(//); call it the induced MAB problem. 
The standard definition of regret is with respect to the best fixed arm, i.e. with respect to /*. We interpret 
an algorithm A for the induced MAB problem as a crowd-selection algorithm: in each round t, the crowd is 
sampled independently at random from the distribution /^t G 7W chosen by A. 

Lemma 8.3. Consider the bandit survey problem with uniform costs. Let Rq be the single-crowd stopping 
rule given by dZ]). Let A be an MAB algorithm for the induced MAB instance. Suppose A has regret 
0{t^~'^ logT) with probability at least 1 — where 7 G (0, Then 



cost(^|i2o) < 0Uf*)-^l^ 



Proof. Let j^Lt G 7W be the distribution chosen by A is round t. Then the total crowd returns each option x 
with probability • D{x), and this event is conditionally independent of the previous rounds given fit. 

Fix round t. Let Nt{x) be the number times option x is returned up to time t by the total crowd, and let 
T)t{x) = \ Nt{x) be the corresponding empirical frequency. Note that 



E 



Vt{x) = fit ■ D{x), where l^t = - '^ l^s- 



s=0 



The time-averaged distribution over crowds fit is a crucial object that we will focus on from here on- 
wards. By Azuma-Hoeffding inequality, for each C > and each option x G O we have 



Pr 



Vt{x) - Ut ■ D{x) 



< 



C_ 



> 1 



(9) 



Let ?j = e{T>t) be the empirical gap of the total crowd. Taking the Union Bound in Equation Q over all 
options X G O, we conclude that et is close to the induced gap of fit'. 



Pr 



|et-/(/^t)l < 



C_ 



>l-ne~^^^^\ foreachC>0. 



In particular, Rq stops at round t with probability at least 1 — y as long as 

f{fit)>t-^'^ (Cqty + 0(7b^)). 



(10) 



By concavity of /, we have /(/i*) > ft, where ft — j Y1\=g fif^s) is the time-averaged virtual reward. 
Now, tft is simply the total virtual reward by time t, which is close to /* with high probability. Specifically, 
the regret of A by time t is R{t) = t{f* — ft), and we are given a high-probability upper bound on R{t). 

Putting this all together, f{f2t) > ft ^ f* — R{t)/t. An easy computation shows that fip-t) becomes 
sufficiently large to trigger the stopping condition ([TOl i for t = O ((/*)^^^^ -v/IogT). □ 



Solving the induced MAB problem. We derive a (possibly inefficient) algorithm for the induced MAB 
instance. We treat Ai as a subset of R^, endowed with a metric d{fi, fi') = n\\fi — fi'\\i. By Lemma [8^ the 
induced gap f{fi) is Lipschitz-continuous with respect to this metric. Thus, in the induced MAB problem 
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amis form a metric space {A4 , d) such that the (expected) rewards are Lipschitz-continuous for this metric 
space. MAB problems with this property are called Lipschitz. MAB 

We need an algorithm for Lipschitz MAB that works with virtual rewards. We use the following simple 
algorithm from |[T8llT9l . We treat 7W as a subset of M'^, and apply this algorithm to M^. The algorithm runs 
in phases j = 1,2,3,... of duration Each phase j is as follows. For some fixed parameter 6j > 0, 
discretize uniformly with granularity 6j. Let Sj be the resulting set of arms. Run bandit algorithm 
UCBl |3] on the arms in Sj. (For each arm in Sj \A4, assume that the reward is always 0.) This completes 
the specification of the algorithm. 

Crucially, we can implement UCBl (and therefore the entire uniform algorithm) with virtual rewards, by 
using tt as an estimate for f{fi). Call the resulting crowd-selection algorithm VirtUnif orm. 

Optimizing the 5j using a simple argument from [18|, we obtain regret 0(ti^i/(*^+2) logT) with prob- 
ability at least (1 — j:). Therefore, by Lemma |83] cost (VirtUnif orm|i?n) suffices to prove Theorem 18. II 

We can also use a more sophisticated zooming algorithm from iflQl . which obtains the same in the worst 
case, but achieves better regret for "nice" problem instances. This algorithm also can be implemented for 
virtual rewards (in a similar way). However, it is not clear how to translate the improved regret bound for 
the zooming algorithm into a better cost bound for the bandit survey problem. 



9 Experimental results: single crowd 

We conduct two experiments. First, we analyze real-life workloads to find which gaps are typical for re- 
sponse distributions that arise in practice. Second, to study the performance of the single-crowd stopping 
rule suggested in SectionSl using a large-scale simulation with a realistic distribution of gaps. We are mainly 
interested in the tradeoff between the error rate and the expected stopping time. We find that this tradeoff is 
acceptable in practice. 

Typical gaps in real-life workloads. We analyze several batches of microtasks extracted from a commercial 
crowdsourcing platform (approx. 3000 microtasks total). Each batch consists of microtasks of the same type, 
with the same instructions for the workers. Most microtasks are related to relevance assessments for a web 
search engine. Each microtask was given to at least 50 judges coming from the same "crowd". 

In every batch, the empirical gaps of the microtasks are very close to being uniformly distributed over 
the range. A practical take-away is that assuming a Bayesian prior on the gap would not be very helpful, 
which justifies and motivates our modeling choice not to assume Bayesian priors. In Figured] we provide 
CDF plots for two of the batches; the plots for the other batches are similar. 




50 100 200 400 



(a) Batch 1: 128 microtasks, 2 options each (b) Batch 2: 604 microtasks, variable #options 

Figure 1: CDF for the empirical gap in real-life workloads. 
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Our single-crowd stopping rule on simulated workloads. We study the performance of the single-crowd 
stopping rule suggested in Section |4] Our simulated workload consists of 10,000 microtasks with two 
options each. For each microtask, the gap is is chosen independently and uniformly at random in the range 
[0.05, 1]. This distribution of gaps is realistic according to the previous experiment. (Since there are only 
two options the gap fully describes the response distribution.) 

We vary the parameter Cqty and for each Cqty we measure the average total cost (i.e., the stopping time 
averaged over all microtasks) and the error rate. The results are reported in Figure In particular, for this 
workload, an error rate of < 5% can be obtained with an average of < 8 workers per microtask. 




0.5 1 1.5 2 0.5 1 1.5 2 



Error rate lO"^ '-^qty ^qty 

(a) Average cost vs. error rate (b) Average cost vs. Cqty (c) Average error rate vs. Cqty 

Figure 2: Our single-crowd stopping rule on the synthetic workload. 

Our stopping rule adapts to the gap of the microtask: it uses only a few workers for easy microtasks 
(ones with a large gap), and more workers for harder microtasks (those with a small gap). In particular, we 
find that our stopping rule requires significantly smaller number of workers than a non-adaptive stopping 
rule: one that always uses the same number of workers while ensuring a desired error rate. 



10 Experimental results: crowd-selection algorithms 

We study the experimental performance of the various crowd- selection algorithms discussed in Section |6] 
Specifically, we consider algorithms VirtUCB and Virt Thompson, and compare them to our straw-man 
solutions: ExploreExploitRollback and RandRR0 Our goal is both to compare the different algorithms 
and to show that the associated costs are practical. We find that ExploreExploitRollback consistently 
outperforms RandRR for very small error rates, VirtUCB significantly outperforms both across all error rates, 
and Virt Thompson significantly outperforms all three. 

We use all crowd-selection algorithms in conjunction with the composite stopping rule based on the 
single-crowd stopping rule proposed Section |4] Recall that the stopping rule has a "quality parameter" Cqty 
which implicitly controls the tradeoff between the error rate and the expected stopping time. 

We use three simulated workloads. All three workloads consist of microtasks with two options, three 
crowds, and unit costs. In the first workload, which we call the easy workload, the crowds have gaps 
(0.3,0,0). That is, one crowd has gap 0.3 (so it returns the correct answer with probability 0.8), and 
the remaining two crowds have gap (so they provide no useful information). This is a relatively easy 
workload for our crowd-selection algorithms because the best crowd has a much larger gap than the other 
crowds, which makes the best crowd easier to identify. In the second workload, called the medium workload, 

'"in the plots, we use shorter names for the algorithms: respectively, VRUCB, VR Thompson, EER, and RR. 
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crowds have gaps (0.3, 0.1, 0.1), and in the third workload, called the hard workload, the crowds have gaps 
(0.3, 0.2, 0.2). The third workload is hard(er) for the crowd-selection algorithms in the sense that the best 
crowd is hard(er) to identify, because its gap is not much larger than the gap of the other crowds. The order 
that the crowds are presented to the algorithms is randomized for each instance, but is kept the same across 
the different algorithms. 

The quality of an algorithm is measured by the tradeoff between its average total cost and its error rate. 
To study this tradeoff, we vary the quality parameter Cqty to obtain (essentially) any desired error rate. We 
compare the different algorithms by reporting the average total cost of each algorithm (over 20,000 runs 
with the same quality parameter) for a range of error rates. Specifically, for each enor rate we report the 
average cost of each algorithm normalized to the average cost of the naive algorithm RandRR (for the same 
error rate). See Figure |3]for the main plot: the average cost vs. error rate plots for all three workloads. 
Additional results, reported in Figure 5](see page [19} show the raw average total costs and error rates for the 
range of values of the quality parameter Cqty. 




EER 
VRUCB 
VR Thompson 



0.05 0.10 

Error rate 
(a) Easy: gaps (.3, 0, 0). 
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VRUCB 
VR Thompson 



0.05 0.10 

Error rate 
(b) Medium: gaps (.3, .1, .1). 
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EER 
VRUCB 
— X — VR Thompson 



0.05 0.10 

Error rate 
(c)Hard: gaps (.3, .2, .2). 



Figure 3: Crowd-selection algorithms: error rate vs. average total cost (relative to RandRR). 



For VirtUCB we tested different parameter values for the parameter C which balances between explo- 
ration and exploitation. We obtained the best results for a range of workloads for C = 1 and this is the value 
we use in all the experiments. For Virt Thompson we start with a uniform prior on each crowd. 

Results and discussion. For the easy workload the cost of VirtUCB is about 60% to 70% of the cost 
of RandRR. VirtThompson is significantly better, with a cost of about 40% the cost of RandRR. For 
the medium workload the cost of VirtUCB is about 80% to 90% of the cost of RandRR. VirtThompson is 
significantly better, with a cost of about 70% the cost of RandRR. For the hard workload the cost of VirtUCB 
is about 90% to 100% of the cost of RandRR. VirtThompson is better, with a cost of about 80% to 90% the 
cost of RandRR. While our analysis predicts that ExploreExploitRollback should be (somewhat) better 
than RandRR, our experiments do not confirm this for every error rate. 

As the gap of the other crowds approaches that of the best crowd, choosing the best crowd becomes 
less important, and so the advantage of the adaptive algorithms over RandRR diminishes. In the extreme 
case where all crowds have the same gap all the algorithms would perform the same with an error rate 
that depends on the stopping rule. We conclude that VirtUCB provides an advantage, and VirtThompson 
provides a significant advantage, over the naive scheme of RandRR. 



18 



Additional plots for crowd-selection algorithms 
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The easy workload: gaps (.3, 0, 0). 
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The medium workload: gaps (.3, .1, .1). 



RR 
EER 
VRUCB 
- VR Thompson 




I 



RR 
EER 
VRUCB 
- VR Thompson 



Cql 



RR 
EER 
VRUCB 
VR Thompson 





RR 
EER 
VRUCB 
- VR Thompson 



1 1.5 
Cqty 



The hard workload: gaps (.3, .2, .2). 
Figure 4: Crowd-selection algorithms: Average cost and error rate vs. Cqty. 
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11 Open questions 



The bandit survey problem. The main open questions concern crowd-selection algorithms for the random- 
ized benchmark. First, we do not know how to handle non-uniform costs. Second, we conjecture that our 
algorithm for uniform costs can be significantly improved. Moreover, it is desirable to combine guarantees 
against the randomized benchmark with (better) guarantees against the deterministic benchmark. 

Our results prompt several other open questions. First, while we obtain strong provable guarantees 
for VirtUCB, it is desirable to extend these or similar guarantees to Virt Thompson, since this algorithm 
performs best in the experiments. Second, is it possible to significantly improve over the composite stopping 
rules? Third, is it advantageous to forego our "independent design" approach and design the crowd-selection 
algorithms jointly with the stopping rules? 

Extended models. It is tempting to extend our model in several directions listed below. First, while in 
our model the gap of each crowd does not change over time, it is natural to study settings with bounded 
or "adversarial" change; one could hope to take advantage of the tools developed for the corresponding 
versions of MAB. Second, as discussed in the introduction, an alternative model worth studying is to assign 
a monetary penalty to a mistake, and optimize the overall cost (i.e., cost of labor minus penalty). Third, one 
can combine the bandit survey problem with learning across multiple related microtasks. 
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A Missing proofs 

A vector inequality. In the proof of Lemma 17. 21 we have used the following general vector inequality: 

Claim A.l. (x ■ a){x ■ P) > miiij aiPifor any vectors a, /? G and any k-dimensional distribution x. 

This inequality appears standard, although we have not been able to find a reference. We supply is a 
self-contained proof below. 

Proof. W.l.o.g. assume aiPi < 02/^2 < • • • < ctj^Pk- Let us use induction on k, as follows. Let 

fix) 4 (x • a)(f • /3) = (xiai + A){xif3i + B) 

where 

M =T.i>iXiai 

Denoting p = xi, we can write the above expression as 

f{x)=p'^ail3i+p{aiB + l3iA) + AB. (11) 

First, let us invoke the inductive hypothesis to handle the AB term in Equation (fTTI) . Let yi = and 

note that {yi}i>i is a distribution. It follows that > a2/32- In particular, AB > (1 — p)^ai/3i. 

Next, let us handle the second summand in Equation (ITTI ). Let us re-write it to make things clearer: 

aiB + PiA = {l-p) ^ ai yi ft + Pi Vi cxi 

i>l 



.«i /3i, 

We handle the term in big brackets using the assumption that ai/3i < ajft. By this assumption it follows 
that ^ > |j and therefore ^ + |^>|j + |^>2. Plugging this into Equation we obtain 

aiB + (3iA>2{l-p)ail3i. 

Finally, going back to Equation (ITT]) we obtain 

f{x) > p2 ai/3i + 2p{p - 1) ai/3i + (1 - pf aift 

= ai/3i. □ 
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Comparison of worst-case error rates. Fix a single-crowd stopping rule Rq. We would like to argue that 
the worst-case error rate of an arbitrary crowd-selection rule A, when used with Rq, is not much smaller 
than the worst-case error rate of the benchmarks, i.e. that of i?o- 

We will need a mild assumption on A: essentially, that it never commits to stop using any given crowd. 
Formally, a crowd-selection algorithm A is called non-committing if for every problem instance, each time 
t, and every crowd i, it will choose crowd i at some time after t with probability one. (Here we consider a 
run of A that continues indefinitely, without being stopped by the stopping rule.) 

Lemma A. 2. Let Rq be a symmetric single-crowd stopping rule with worst-case error rate p. Let Abe a 
non- committing crowd-selection algorithm, and let R be the composite stopping rule based on Rq which 
does not use the total crowd. If A is used in conjunction with R, the worst-case error rate is at least 
p{l — 2kp), where k is the number of crowds. 

Proof. Suppose Rq attains the worst-case error rate for a crowd with gap e. Consider the problem instance 
in which one crowd (say, crowd 1) has gap e and all other crowds have gap 0. Let be the instance of 
Rq that takes inputs from crowd i, for each i. Let E be the event that each i > 1 does not ever stop. 
Let E' be the event that stops and makes a mistake. These two events are independent, so the error 
rate of R is at least Fr[E] Fi[E']. By the choice of the problem instance, Pr[i?'] = p. And by Lemma [631 
Fi[E] > 1 — 2kp. It follows that the error rate of R is at least p{l — 2kp). □ 
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